The format of this lab follows the same format as the previous ones. Your goal is to predict the value of the third column (which will be missing on the test set) using the techniques we have learned so far.

Set up

Read in the following libraries and to load the crimes dataset:

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
library(ggmap)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(FNN)
library(gam)
## Loading required package: splines
## Loaded gam 1.14-4
tract <- read_csv("https://statsmaths.github.io/ml_data/tract_median_income.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   obs_id = col_character(),
##   train_id = col_character(),
##   state = col_character(),
##   county = col_character(),
##   cbsa_name = col_character(),
##   cbsa_type = col_character(),
##   total_population = col_integer(),
##   total_households = col_integer()
## )
## See spec(...) for full column specifications.

This lab will be scored using RMSE.

Lab 09

To start let us do a few plots to help us better picture what we are trying to predict as well as the data subets

qmplot(lon, lat, data = tract,
      color = tract$train_id, size = I(0.5)) 
## Using zoom = 3...
## Map from URL : http://tile.stamen.com/toner-lite/3/0/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/0/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/0/3.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/3.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/3.png
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

The plot of income quartiles seem to be a good predictor of median incomes. Also, the quartiles are very highly correlated which may indicate we only need one or a few of them to use as a predictor, similar to the age variables (although they do not exhibit as strong correlations across each other as income quartiles)

qplot(income_q1, median_income, color = cbsa_type, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

income_frame <- data.frame(tract$income_q1,tract$income_q2,tract$income_q3,tract$income_q4,tract$income_p95)
ggpairs(income_frame)

qplot(age_15_18, median_income, color = cbsa_type, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

age_frame <- data.frame(tract$age_00_02,tract$age_03_04,tract$age_05_05,tract$age_06_08,tract$age_09_11,tract$age_12_14,tract$age_15_18)
ggpairs(age_frame)

Next, I did a few exploratory graphs of the various variables but only kept those that seemed to have some sort of effect (or potential effect) on median income. -We see that as avg. duration is farther from zero, the house values tend to fall. -Also, we see that car_alone and public_transit seem to be good predictors of median income -Healthcare, especially private healthcare also seems to be significant in predicting median income -Housing costs seem to have a direct correlation to median income as well. As does gini, the measure of inequality

qplot(avg_duration, median_income, data=tract) 
## Warning: Removed 12799 rows containing missing values (geom_point).

qplot(car_alone, median_income, data=tract) 
## Warning: Removed 12799 rows containing missing values (geom_point).

qplot(public_transit, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

qplot(healthcare_private, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

qplot(housing_costs, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

qplot(median_income, gini, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).

Next, I decide to include many of the variables identified above in an elastic net and tune the parameter to determine what variables I may want to include in my final model. After cycling through various values of lambda up to 50, I found that only 3 variables came out with significant variables.

X1 <- as.matrix(tract[,10:70])
y1 <- tract$median_income
X_train <- X1[tract$train_id == "train",]
y_train <- y1[tract$train_id == "train"]

model <- cv.glmnet(X_train, y_train, alpha = 0.9)
model$lambda
##  [1] 31804.6840 28979.2436 26404.8074 24059.0771 21921.7350 19974.2685
##  [7] 18199.8095 16582.9886 15109.8016 13767.4885 12544.4228 11430.0108
## [13] 10414.6002  9489.3958  8646.3840  7878.2631  7178.3799  6540.6724
## [19]  5959.6171  5430.1812  4947.7788  4508.2318  4107.7329  3742.8132
## [25]  3410.3120  3107.3493  2831.3010  2579.7760  2350.5958  2141.7753
## [31]  1951.5059  1778.1395  1620.1745  1476.2427  1345.0973  1225.6026
## [37]  1116.7234  1017.5168   927.1234   844.7603   769.7141   701.3349
## [43]   639.0302   582.2605   530.5341   483.4029   440.4587   401.3296
## [49]   365.6766   333.1909   303.5911   276.6209   252.0467   229.6555
coef(model, s = model$lambda[50])
## 62 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)             463.48519407
## total_population          .         
## total_households          .         
## income_q1                 .         
## income_q2                 0.58688980
## income_q3                 0.42313607
## income_q4                 .         
## income_p95                .         
## median_age                .         
## age_00_02                 .         
## age_03_04                 .         
## age_05_05                 .         
## age_06_08                 .         
## age_09_11                 .         
## age_12_14                 .         
## age_15_18                 .         
## same_house                .         
## same_county               .         
## same_state                .         
## avg_duration              .         
## car_alone                 .         
## carpool                   .         
## public_transit            .         
## walked                    .         
## transit_other             .         
## at_home                   .         
## white_alone               .         
## black_alone               .         
## native_american_alone     .         
## asian_alone               .         
## pacific_alone             .         
## other_alone               .         
## multiracial               .         
## healthcare_private        .         
## healthcare_public         .         
## healthcare_none           .         
## housing_costs             0.01097553
## housing_000000_000000     .         
## housing_010000_014999     .         
## housing_015000_019999     .         
## housing_020000_024999     .         
## housing_025000_029999     .         
## housing_030000_034999     .         
## housing_035000_039999     .         
## housing_040000_049999     .         
## housing_050000_059999     .         
## housing_060000_069999     .         
## housing_070000_079999     .         
## housing_080000_089999     .         
## housing_090000_099999     .         
## housing_100000_124999     .         
## housing_125000_149999     .         
## housing_150000_174999     .         
## housing_175000_199999     .         
## housing_200000_249999     .         
## housing_250000_299999     .         
## housing_300000_399999     .         
## housing_400000_499999     .         
## housing_500000_749999     .         
## housing_750000_999999     .         
## housing_above_1_million   .         
## gini                      .
plot(model)

model2_pred <- predict.cv.glmnet(model, newx=X1, s="lambda.1se")
sqrt(tapply((tract$median_income - model2_pred)^2,
            tract$train_id, mean))
##     test    train    valid 
##       NA 3154.908 3197.087

Next, let us see how well we can capture the lon. and lat. variables using a knn.reg function. We find that the model seems to do a relatively good job capturing the variation in incomes. Below is also code for a function that tunes the k value for our knn reg. It makes sense tune for low values of k since housing values can significant vary from one (lat,lon) combination to another. We find that a value of 5 leads to the lowest validation set RMSE so we use that for our model.

X <- as.matrix(select(tract, lon, lat))
y <- as.matrix(tract$median_income)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]

tuning_function <- function(){
  
 result_final <- 30000
 i_final <- 1
  
 for(i in 1:30){
    mod <- knn.reg(train = X_train, test = X, y = y_train, k = i)
    mod_pred <- mod$pred
    result <- sqrt(tapply((tract$median_income - mod_pred)^2, tract$train_id, mean))[3]
    if(result < result_final){
      result_final <- result
      i_final <- i
    }
  }
  
  return(i_final)
}

result <- tuning_function()
result
## [1] 5
model1 <- knn.reg(train = X_train, test = X, y = y_train, k = result)
model1_pred <- model1$pred
sqrt(tapply((tract$median_income - model1_pred)^2,
            tract$train_id, mean))
##     test    train    valid 
##       NA 15101.52 19458.68

Finally, let us try combining these two in a gam function to evaluate whether it improves our RMSE. I also added a few more variables that very slightly reduced the RMSE of my model but make intuitive sense to include. This will be my final model.

model3 <- gam(median_income ~ s(model1_pred) + s(model2_pred) + s(gini)+ s(housing_above_1_million) + s(healthcare_private), subset = train_id == "train", data=tract)
model3_pred <- predict(model3, newdata = tract)
sqrt(tapply((tract$median_income - model3_pred)^2,
            tract$train_id, mean))
##     test    train    valid 
##       NA 3089.978 3127.678

Submission

The code below assumes that you have added a prediction named median_income_pred to every row of the dataset.

tract$median_income_pred <- model3_pred
submit <- select(tract, obs_id, median_income_pred)
write_csv(submit, "class09_submit.csv")

Now, upload this file (ends with “.Rmd”), the HTML output (ends with “.nb.html” or “.html”), and the csv file to GitHub.